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Abstract. With the dissemination of affordable parallel and distributed 
hardware, parallel and distributed constraint solving has lately been the 
focus of some attention. To effectually apply the power of distributed 
computational systems, there must be an effective sharing of the work 
involved in the search for a solution to a Constraint Satisfaction Problem 
(CSP) between all the participating agents, and it must happen dynami- 
cally, since it is hard to predict the effort associated with the exploration 
of some part of the search space. We describe and provide an initial 
experimental assessment of an implementation of a work stealing-based 
approach to distributed CSP solving. 



1 Introduction 

Constraints are used to model problems with no known polynomial algorithm, 
but for which search techniques developed within the field of constraint program- 
ming provide viable procedures. Besides classical applications, such as planning 
and scheduling, constraints have recently been successfully applied in the con- 
texts of bioinformatics [1] and computer network monitoring [12]. 

Notwithstanding their relative efficiency, constraint solving methods are com- 
putationally demanding and good candidates to benefit from multiprocessing. 
Moreover, the declarative style of constraint programming frees the programmer 
from concerns usually entailed by parallel and distributed programming, such 
as control, synchronisation, and communication issues. In fact, the programmer 
may not even be aware that there is any parallelism involved in solving the 
problem. 

Given the increasing availability of parallel computational resources, in the 
form of multiprocessors, clusters of computers, or both, there is a need for an 
effective way to help incorporating that power into the constraint programming 
setting. In this context, our goal is to build a library which takes advantage of 
parallel hardware in a transparent way, for constraint solving. 

In parallel constraint solving [5, 15, 13, 8, 3] , the problem is partitioned around 
the domains of the variables, effectively partitioning its search space. The search 
for a solution is then carried out in each of the sub-search spaces by one agent 
(or worker), all agents working in parallel. 



Constraint solving involves exploring large search spaces. To perform search 
using several agents in parallel, the search effort must be shared among them. 
This may happen either by having each agent do a part of the work and co- 
ordinate with the other agents, in order to reach the intended goal (which is 
the approach taken in solving Distributed CSPs [17]), or the agents may be 
mostly independent from each other, performing their (non-overlapping) part of 
the work, hoping that one of them will find a quicker path to an answer. While 
the first approach typically requires significant inter-agent communication, not 
only for the search to progress but also for termination detection, in the latter 
communication can be limited to an initial dispatching of the agents and to 
an answer collecting phase at the end of the procedure. In this case, however, 
the initial work distribution may turn out to be quite unbalanced, leaving some 
agents to bear most of the effort as others become idle and their contribution is 
wasted. 

This article reports on preliminary results of our experiments in implementing 
a work-stealing scheme for overcoming the effect described above. This is a two- 
level scheme: work stealing occurs between co-located agents, but when distant 
agents are involved, some cooperation is needed to redistribute the work still 
left. 

The remainder of this paper is structured as follows: we start by establishing 
some terminology in the next section. Then, in Sections 3 and 4 we describe the 
architecture of the implemented solver and report on some experimental results 
obtained with it. Section 5 discusses related work and in Section 6 we conclude 
and put forward possible continuation paths for this work. 

2 Constraint Solving 

A constraint satisfaction problem can be briefly defined as a set of variables 
whose values, to be drawn from their domains, must satisfy a set of relations. 

Definition 1 (CSP). A Constraint Satisfaction Problem (CSP) over finite do- 
mains is a triple P — (X, D,C), where 

— X = {x\, x-2, . . . , x n } is an indexed set of variables; 

— D = {Di,D 2 , . . . , D n } is an indexed set of finite sets of values, with Di being 
the domain of variable Xi, for every i = 1,2, ... ,n; and 

— C = {c\,C2, ■ ■ ■ , c m } is a set of relations between variables, called the con- 
straints. 

The search space of a CSP consists of all the tuples from the cross product of 
the domains, where each variable is assigned a value from its domain. Solving a 
CSP amounts to finding some or all of those tuples which satisfy all constraints 
of the problem. 

Definition 2 (Solution). A solution to a CSP is an n-tuple (vi,v 2 , . . . , v n ) E 
Di x £>2 x . . . x D n such that all constraints are satisfied. 



In parallel constraint solving, the problem is divided into subproblems. So- 
lutions to these subproblems are also solutions to the original problem. 

Definition 3 (Subproblem). A subproblcm of a CSP P = (X,D,C) is a 
CSP P' = (X, D', C) such that D' = {D[, D' 2 , ... , D' n } and D\ C D l7 for every 
i = 1, 2, . . . ,n. 

To guarantee completeness of the search, the search spaces of the subproblems 
must cover the search space of the original problem. In order to avoid redundant 
work, they must also be pairwise disjoint. 

Definition 4 (Partition). A set {P{, P' 2 , . . . , P^} of subproblems of a CSP P, 
with PI = (X, {D' a ,D' i2 , . . . , D' in }, C), is a partition of P if 

(J D' a x D' i2 x • • • x D' in = D 1 x D 2 x ■ ■ ■ x D n 

l<i<k 

and (Vi + j) D' a x D> 2 x • • • x D' m n D'^ x D' j2 x • • • x D' jn = 0. 

A partition of a CSP may be dually regarded as a partition of its search 
space, the search spaces of the subproblems being sub-search spaces of the orig- 
inal problem. In this paper we will only deal with search space partitions that 
correspond to some partition of a problem. 



3 Solver Architecture 

Our constraint solver consists of workers, grouped together as teams (Figure 1). 
The search for one or all solutions is carried out by the workers, which implement 
a propagator based constraint solving engine, following a domain consistency 
oriented approach [2] . Each active worker has a pool of idle search spaces and a 
current search space, the one it is currently exploring. In each team there is a 
controller, which does not participate in the search, and one of the controllers, 
the main controller, also coordinates the teams. 
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Fig. 1. Solver architecture 



Structuring the workers this way serves two purposes: the first is that a 
workers' sole task becomes searching, as all communication with the environ- 
ment required by the dynamic sharing of work among teams is handled by the 
controller. The second objective is the sharing of resources enabled by binding 
the workers in a team close together. If all workers were on the same level, they 
would either have to divide their attention between search and communication 
or there would have to be one controller per worker, thereby increasing resource 
usage. On the other hand, this structure matches naturally a two-level partition- 
ing of the search space and we obtain receiver-initiated decentralised dynamic 
load balancing [16]. 

At the outset of the search process, the problem to be solved is partitioned 
and each team is entrusted with trying to solve one of the resulting subproblems. 
The controller in each team then partitions the local problem and hands each 
sub-search space over to a worker for exploration. 

On finishing exploring its assigned search space, a worker tries to steal work 
from another worker within its team. If unsuccessful, it then notifies the team 
controller that it has become idle. When all the workers in a team are idle, the 
controller asks the other teams for more work. 

3.1 Partitioning Strategies 

The strategy used to partition the search space has a decisive impact on the 
number of steps needed to get to a solution, hence on performance. 

Partitioning strategies may be designed either to lead to a balanced dis- 
tribution of the search work, like the even strategy below and the prime and 
greedy strategies from [14], or to produce some subproblems where the search 
is expected to be quick (while others may be slow), such as eager partitioning. 
In principle, the former strategies will be more suited to situations where all 
solutions are requested and the whole search space must be visited, and the 
latter will lend themselves better to when one solution is enough. In any case, 
the splitting of the problem will introduce a breadth-first component into the 
usual depth-first exploration of the search tree, which sometimes gives rise to 
superlinear speedups. 

In even partitioning, domains are split so as to obtain sub-search spaces of 
similar dimensions. If we want to split a problem into k subproblems, then the 
first variable with at least that many values in its domain is chosen and its 
domain is split as evenly as possible among the subproblems: if the domain 
of the chosen variable has d > k values, then it will have [d/k\ values in the 
first k — dmodk subproblems and [d/k\ + 1 values in the remaining dmodfc 
subproblems. 

Eager partitioning corresponds roughly to a partial breadth-first expansion 
of the search tree and it will mostly produce subproblems where at least one 
of the variables has had its domain reduced to a single value. The splitting is 
performed according to the algorithm depicted in Figure 2, whose inputs are 
the number of subproblems to create and a sequence of problems from which to 
create them. Initially, this sequence only contains the original problem. 



Notation If P is a CSP and D is a finite set, PDi stands for the CSP which 
is identical to P except that the domain of the i th variable is D. 

eager-split(fc, (Pi P 2 ■ ■ ■ P q )) 
(X,D,C)<-P 1 
i <- min{j | \Dj \ > 1} 

<f «- |A| 

{V1,V2, ■ ■ ■ , Vd} 4- Di 

if k < d then 

(Pi{vi}i Pi{v 2 }i---Pi{vk,...,v d }i P 2 ---P q ) 
else 

eager-split(fc - d + 1, (Pi ■ ■ ■ P q Pi{«i}* Pi{v 2 }i ■ ■ ■ Pi{v d }i)) 
Fig. 2. Eager partitioning algorithm 



Figure 3 shows the result of applying eager partitioning to split into six a 
problem where the domain of all variables is {a, b, c}. (Only the variables whose 
domains are affected by the splitting are shown.) 




Fig. 3. Eager partition into 6 sub-search spaces 



The partitioning of the CSP may affect the behaviour of the search, even 
to the point of defeating the variable and value selection heuristics which are 
usually appropriate to a given problem, as has been noted in [8, Section 6]. This 
suggests that the partitioning strategy, introducing another degree of freedom 
in the search strategy, needs to be adapted to the problem being solved and 
matched with the search heuristics used, and that no overall 'best' partitioning 
strategy exists. (Notice that, for the present, problem specific heuristics do not 
inform problem partitioning.) 

As problem partitioning takes place at two points in the process — to dis- 
tribute work to all the teams, and, initially within every team, to assign work to 
each worker different splitting strategies can be used, a more balanced one to 
allot similar amounts of work to the individual teams, and another to focus the 
efforts of the agents. The latter strategy could be finer grained than the former, 
the cost of local work stealing being much lower than that of network supported 
work sharing. 



Additionally, in parallel search, different teams might split their problems 
differently, allowing us to take advantage of one not yet identified strategy being 
more effective than the others for the problem at hand. 



3.2 Search 

The search unfolds as a worker further splits the search space it is working on, 
keeping one part as its current search space and adding the other to its pool of 
idle search spaces. If the current search space is found to contain no solution, 
the worker draws a new search space from the pool and starts exploring it, never 
backtracking. Upon finding a solution, the worker communicates it to the team 
controller which, in turn, forwards it to the main controller. 

The state of a worker with two search spaces currently in the pool is shown 
in Figure 4, where solid edges mean that the child search spaces form a partition 
of the parent. Notice that the subtree to the left of the current search space 
(corresponding to the tuples where both x\ and X2 take value 1) has already 
been explored and discarded, and is not displayed. 
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Fig. 4. Search spaces from a worker 



Figure 5 depicts the main driver algorithm for workers. At each step of the 
search process, a worker starts by looking within its current search space for 
a variable whose domain is not a singleton (line 3). If none is found, then the 
search space contains a single tuple which constitutes a solution to the problem, 
and which is returned by the worker (line 10). Otherwise, one of the variables 
with a non-singleton domain is selected and the current search space is split into 
two subspaces (line 4): 

— In the first, which will become the worker's current search space, the selected 
variable is set to an individual value picked from its domain. 

— In the other, to be added to the pool of idle search spaces (line 5), that value 
is removed from the domain of the variable. 



The domains of the other variables remain unchanged in both search spaces. 

Following the split, the new current search space goes through a propagation 
phase (line 6). If it succeeds, another search step is performed. If the propaga- 
tion fails, the worker tries to fetch an idle search space from the pool to become 
the current search space (line 7). If this is not possible the worker fails (line 9), 
otherwise the search resumes with the retrieved search space undergoing a prop- 
agation phase, as the domain of one of its variables shrunk just prior to it being 
stored in the idle pool. 

1: WORKER(search-space) 

2: current <— search-space 

3: while var <— select-variable(current) do 

4: (current, other) <— split-search-space(var, current) 

5: pool-put(other, var) 

6: while (current <— revise(var, current)) = FAIL do 

7: (current, var) <— pool-get() 

8: if current = FAIL then 

9: return FAIL 

10: return SOLUTION(current) 

Fig. 5. Worker main driver algorithm 



3.3 Work Stealing 

When a worker tries to fetch a new search space from its pool and finds it 
empty, it will attempt to obtain one from one of its teammates. In order to 
minimise the impact on the performance of the solver, this is achieved with as 
little cooperation from the holder of the retrieved search space as possible. In 
fact, the idle worker will effectively steal work from a teammate while the latter 
continues its task, oblivious to what is being done to its work queue. 

The intended discipline of a worker's pool is that of a deque (double-ended 
queue), as depicted in Figures 6 and 7. While the owner works on one end of 
its pool (lines 2, 8, and 12), a worker whose pool is empty will remove an entry 
from the other end (line 20). This way, the only penalty a worker incurs during 
normal processing is the cost of an extra check on the size of its pool (line 6). 
The protocol used to avoid interference during pool accesses is similar to the one 
in [6] . Only when the number of entries in the pool is small, will it be necessary 
to enforce mutual exclusion in the accesses to the pool, and even then only when 
removing a search space. To reduce contention, work stealing is only allowed 
from a pool when the number of entries in it reaches a given threshold (line 17). 

Stolen work corresponds to locations nearer the root of a worker's search tree. 
The search within the worker's search space proceeds according to the heuristics 
deemed adequate to the problem until it either finds a solution or the work is 
exhausted. Upon stealing work from a peer, a worker picks up the search at a 
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pool-put(search-space, variable) 

pool. append (search-space, variable) 
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pool-get() 



if pool. size = then 
return steal-work() 



else if pool. size < SAFE-SIZE then 
lock(pool) 

ss <— pool.remove-last() 

unlock(pool) 

return ss 



else 

return pool.remove-last() 



Fig. 6. Pool insertion and removal 



13: steal-work() 

14: lock(stealing) 

15: v <— worker-with-biggest-pool() 

16: lock(v.pool) 

17: if v.pool.size < THRESHOLD then 

18: ss «- FAIL 

19: else 

20: ss <— v.pool.remove-first() 

21: unlock(v.pool) 

22: unlock(stealing) 

23: return ss 



point that the worker it was stolen from would eventually reach, thus subverting 
the problem's search strategy and introducing in it a measure of randomness. 
This may be either beneficial or detrimental, depending on the specific problem. 

In the event of an idle worker failing to obtain work within its team, it notifies 
the team controller and waits, either to be later restarted or to be terminated. 
When all the agents in a team have become idle, the team controller broadcasts 
a request for more work to the other teams. 

Inter-team work stealing follows along a simple plan: initially, one of the team 
controllers is given the role of fulfilling requests for work. Upon receiving one, 
and using the same protocol used by the workers, it tries to steal a search space 
from the local pool to be forwarded to the requester, which splits it among its 
workers and becomes the new work supplier. If the designated work supplier is 
unable to spare a search space, the remaining teams are polled for work, as done 
in [13]. When no team is able to supply additional work, the idle team notifies 
the main controller and terminates. 



Fig. 7. Work stealing algorithm 



3.4 Implementation Notes 

One of the main goals behind this work was to build a constraint solver which 
could take advantage of the advances in parallel architectures and in clustering 
network technology. To better be able to handle the challenges inherent to mul- 
tiprocessing, namely memory management and caching issues, C was our choice 
for the implementation language, as it allows for very fine-grained control. 

A key idea behind the implementation is that of store. A store describes the 
domains of the variables of the problem and represents a sub-search space of 
the initial problem. Stores constitute the state of a worker and are meant to be 
self-contained and dense. The search spaces in the idle pools of each worker are 
represented by stores, which may be copied between workers and transmitted 
between teams, thus allowing the redistribution of the search. 

Teams are autonomous entities and each team corresponds to a distinct pro- 
cess, usually residing on a dedicated machine. As communication, particularly 
over a network, may have an adverse impact on system performance, care has 
been taken to minimise the number of inter-team messages needed. Teams are 
coordinated by way of an IPC library. 

A team comprises active components which are the workers and the con- 
troller. The controller is, most of the time, waiting for a worker or another team 
controller to communicate with it, not disturbing the search process and allow- 
ing workers to be mapped to processors. Workers are mostly independent from 
each other, except where work stealing is concerned, as explained in Section 3.3. 
A worker, to be able to steal work from another one without active cooperation 
from the latter, must be able to access all the team pools. To make this possible, 
pools are located in shared memory and workers, as well as the controller, are 
implemented as lightweight processes (threads). 

4 Experimental Results 

In this Section, we present some performance results obtained with our solver on 
three classic benchmark problems, namely the non attacking queens problem, the 
Golomb ruler problem [4, problem 006], and the Langford number problem [4, 
problem 024] . Measurements were made of the time taken to count all solutions 
for the three problems and for generating the first solution in the latter problem. 

These measurements were made on a cluster of Q6600 Intel Core2 Quad 
CPUs, clocked at 2.4GHz, with 2-4GB RAM, running Linux, and the code was 
compiled with GCC 4.1.1 with the '-03' flag. The times presented are the average 
of the middle 10 times from 12 runs of each program. When computing the rela- 
tive performance with respect to the sequential case, we subtracted the overhead 
associated with starting up and terminating the solver, which reached a max- 
imum of 0.2 seconds in the 6 teams configuration. Unless otherwise indicated, 
teams are composed of 4 workers, mirroring the number of CPUs in the shared- 
memory multiprocessor systems. For interprocess communication, the Open MPI 
MPI-2 implementation [10] was used. 



Absolute performance has not, so far, been the top priority goal of this work. 
Nevertheless the sequential (1 team with 1 worker) version of our solver already 
displays interesting times for solving these problems, as attested by Table 1, 
where they are compared with those of Gecode [7] , although there clearly remains 
some work to be done in that regard. 



Table 1. Times comparison with Gecode (seconds) 
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The current implementation still suffers from some limitations which restrict 
the range of problems we are able to run. One is the internal representation 
of the domains, which only allows values between and 63. Another is that 
we do not deal with optimisation constraints. These are required by the typical 
formulation of the Golomb ruler problem where, to make up for their absence, 
we bound the domains of the variables from above by the known minimum value 
of the last mark of the ruler. 

In the remainder of this section, we look at the results obtained with several 
configurations of the solver and analyse them with respect to the speedups in- 
duced by the parallelisation of the search, using the two partitioning strategies. 
The use of the two strategies helps to highlight the effect of work stealing. 

To study the effects of the parallelisation of the solving procedure on the 
Golomb ruler problem with 10 marks, we measured the amount of work associ- 
ated with each value from the domain of the variable mi corresponding to the 
first positive mark of the ruler. This is the variable around which the problem 
is partitioned. 

Table 2 shows the weight of exploring all the tuples where variable toi takes 
one value from its domain within the effort of exploring the whole search space. 
For instance, the time taken to explore the subtree where nil has value 3 is 
17.1% of the time needed to explore the whole tree, and the times for the first 
three values of the domain together correspond to 66.5% of the total time. All 
values between 10 and 55, which is the minimum length of a 10-mark Golomb 
ruler, together account for about 1.4% of the total work effort. These values also 
reflect the amount of pruning that takes place during the search. 

From Figure 8, we can see that the speedup increase is roughly linear up to 3 
workers and improves up to 4 teams. Results are similar with both partitioning 
strategies, which speaks for the effectiveness of the work stealing implementation. 
For more than 4 teams, when total running times are around 0.7 seconds, the 
overhead of communication takes over. 



Table 2. Work distribution in the Golomb ruler problem (10 marks, all solu- 
tions), with respect to the value of the first positive mark 
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Fig. 8. Speedups for the Golomb ruler (10 marks, all solutions) 



For up to 6 teams, even partitioning will make the domain of m\ in the first 
team correspond to at least 98.6% of the work. Without work sharing this would 
mean that the speedup would be bounded by 4. Work sharing allows it to reach 
higher values, even with the simple scheme employed. Given the structure of the 
problem, eager partitioning divides the work more evenly among the subproblems 
and with 6 teams the work distribution will be approximately 25-24-17—12-8- 
14%. The fact that the speedups are similar in both cases is due to work sharing 
evening out these differences. However, in the latter situation speedups of around 
8 should be possible, even without intcr-tcam work sharing. 

In the non attacking queens problem, the first observation that can be made 
in relation to the speedups obtained, depicted in Figure 9, is that they are fairly 
insensitive to the partitioning strategy used, which is a trend in the results 
referred in this section. Given that in this problem the work is very evenly 
distributed among the possible values from the domains of the variables, this 
result is only possible due to effective work sharing. 

The profile of the speedups evolution with the addition of more teams is 
illustrating in this case. While it is quasi-linear for the 16 queens problem, show- 
ing good scalability of the approach, the smaller problem starts suffering from 
the weight of the implementation early on. Total running times for the three 
problems in the 6 team setting are around 1.2, 4.8, and 27.5 seconds, for 14, 15, 
and 16 queens, respectively. 

The Langford number problem, for which we measured both the speedups for 
counting all solutions and for obtaining the first solution, is an example of a case 
where domain partitioning interacts badly with the heuristics usually used for 
guiding the search, as dividing a domain gives rise to more work than that needed 
to solve the original problem. This is apparent in Figure 10a, which represents 




Fig. 9. Speedups for the non attacking queens (all solutions) 



the results observed in finding the first solution and where some instances of 
the problem displayed a marked slowdown when partitioning the domain of the 
first variable in two or three similarly size parts. On the other hand, speedups 
of more than 3000 were also obtained in one case. 

Counting all solutions of the Langford problem (Figure 10b) exhibits a profile 
common to the previous problems, where at some point the implementation 
starts overwhelming the potential improvements due to the parallelisation. This 
effect requires further study to identify and solve its causes. 
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Fig. 10. Speedups for the Langford number problem 



5 Related Work 

Our aim is to find how best to take advantage of the available parallel architec- 
tures for constraint solving, taking into account factors such as the search and 

1 In these graphs, solid and dashed lines correspond, respectively, to even and eager 
partitioning. 



problem partitioning strategies, as well as communication and memory access 
patterns which may help in minimising the overhead introduced by the com- 
petition for resources involved in a parallel or distributed system and by the 
coordination of the system components. 

Recent years have seen an increase in the interest in parallel solving, as par- 
allel architectures become more common. An early language sporting parallel 
constraint solving was the CHIP parallel constraint logic programming language 
[15]. It was implemented on top of the logic programming system PEPSys, whose 
or-parallel resolution infrastructure was adapted to handle the domain opera- 
tions needed in parallel constraint solving. 

More recent works rely on features of an underlying framework for program- 
ming parallel search. The concurrent Oz language provides the basis for the 
implementation described in [13], where search is encapsulated into computa- 
tion spaces and a distributed implementation allows the distribution of workers. 
Work sharing is coordinated by a manager, which receives requests for work from 
the workers and then tries to find one willing to share the work it has left. Search 
strategies are user programmed and the work sharing strategy is implemented 
by the workers. 

A similar approach is taken in [8, 9] which show how to program parallel 
search controllers in Comet. There, the pool is an active object which is queried 
by the idle workers. In case the pool is empty, it asks another worker to generate 
yet unexplored sub-search spaces, gives one away and stores the rest. It is not 
explained, however, how the worker which supplies work is chosen. 

A focus of research has been on the strategies for splitting the work between 
workers. These strategies may be driven by the problem structure, such as the 
size of the domains [14], or by the past behaviour of the solver, be it related with 
properties of the solving process, such as the number of variables already instan- 
tiated [11], or with the progress of the search, in what it affects the prospects of 
finding a solution in the current subtree [18] or in the subtrees left to explore [3]. 

6 Conclusions and Future Work 

Parallelisation seems to be a natural way of improving the performance of CSP 
solving, and the results presented in this paper confirm the gains it may pro- 
duce. However, as the performance of sequential search for different problems is 
highly dependent on the heuristics used, it remains a challenge to identify the 
partitioning strategies which will be more appropriate to the parallel solving of 
each problem. While it may be tempting to adapt the sequential search heuris- 
tics to problem splitting, the granularity they induce on the sub-search spaces 
may be too fine and lead to too great a communication overhead in distributed 
settings. So, there is a tradeoff to be struck between how closely partitioning 
follows the 'optimal' search order for a given problem and the impact it has on 
the operation of the system. 

In [3] , a scheme is presented which uses the search heuristics to guide problem 
splitting, dampened by a degree of confidence to distribute the workers across 



the search tree while maintaining some bias towards the nodes favoured by the 
heuristic. It shows good performance on multi-core hardware, and while it has 
the drawback of working on a global view of the search process, it seems to point 
in a promising direction of research, namely using the work done as a guide to 
future search space splitting. 

In spite of the results obtained so far, there should be additional gains with 
a more sophisticated work sharing protocol. Several possibilities should be stud- 
ied, including having a different work stealing policy for inter-team sharing, 
where candidate search spaces undergo a deeper examination to try to deter- 
mine whether the cost of their sending is offset by the work saved locally. 

Planned developments of this work, besides tackling the defects and limita- 
tions identified in this text, comprise the inclusion of optimisation constraints 
and the improvement of the scalability of the implementation in two key aspects: 
the initial work distribution and the sharing of work between teams, which could 
both profit from organising the teams in multi-level neighbourhoods. 

We plan on experimenting with different underlying models and libraries for 
thread management and inter-process communication, namely to venture beyond 
the present implementation which relies on Posix threads and MPI. 
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